A New Frequent Similar Tree Algorithm Motivated by Dom Mining - Using RTDM and its New Variant - SiSTeR

نویسندگان

  • Omer Barkol
  • Ruth Bergman
  • Shahar Golan
چکیده

The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size n1 and n2, the current methods take time O(n1 ・ n2) to compute RTDM. Consider, however, looking for patterns that form subtrees within a web page with n elements. The RTDM must be computed for all subtrees, and the running time becomes O(n 4 ). This paper proposes a new algorithm which computes the distance between all the subtrees in a tree in time O(n 2 ), which enables us to obtain better quality as well as better performance, on a DOM mining task. In addition, we propose a new tree edit-distance—SiSTeR (Similar Sibling Trees aware RTDM). This variant of RTDMallows considering the case were repetitious (very similar) subtrees of different quantity appear in two trees which are supposed to be considered as similar. External Posting Date: June 28, 2012 [Fulltext] Approved for External Publication Internal Posting Date: June 28, 2012 [Fulltext]  Copyright 2012 Hewlett-Packard Development Company, L.P. A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING Using RTDM and its new variant — SiSTeR Barkol Omer, Bergman Ruth and Golan Shahar HP Labs, Technion City, Haifa Israel {omer.barkol, ruth.bergman, shahar.golan}@hp.com

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

A New Algorithm for High Average-utility Itemset Mining

High utility itemset mining (HUIM) is a new emerging field in data mining which has gained growing interest due to its various applications. The goal of this problem is to discover all itemsets whose utility exceeds minimum threshold. The basic HUIM problem does not consider length of itemsets in its utility measurement and utility values tend to become higher for itemsets containing more items...

متن کامل

PreRkTAG: Prediction of RNA Knotted Structures Using Tree Adjoining Grammars

Background: RNA molecules play many important regulatory, catalytic and structural <span style="font-variant: normal; font-style: norma...

متن کامل

A New Routing Algorithm for Vehicular Ad-hoc Networks based on Glowworm Swarm Optimization Algorithm

Vehicular ad hoc networks (VANETs) are a particular type of Mobile ad hoc networks (MANET) in which the vehicles are considered as nodes. Due to rapid topology changing and frequent disconnection makes it difficult to design an efficient routing protocol for routing data among vehicles. In this paper, a new routing protocol based on glowworm swarm optimization algorithm is provided. Using the g...

متن کامل

MINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS

This research aims at proposing a new method for discovering frequent temporal itemsets in continuous subsets of a dataset with quantitative transactions. It is important to note that although these temporal itemsets may have relatively high textit{support} or occurrence within particular time intervals, they do not necessarily get similar textit{support} across the whole dataset, which makes i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011